Measuring Text Reuse in a Journalistic Domain
نویسندگان
چکیده
This paper describes a general framework for measuring text reuse. This term is used to describe how content from a single or multiple number of known sources can be reused either verbatim (word-for-word copy) or otherwise rewritten depending upon factors influencing the creation of a new document. These may include reduction/ increase in length, change of style, simplification of content, shift in audience focus or disguise of original source. Text from a document that can, with confidence, be shown to have been rewritten from a source or sources is called derived. Measuring accurately the amount of reused text between documents would find applications in automatic plagiarism detection, detecting breaches of intellectual copyright and automatic evaluation of machine translation software. One area with such a need is that of reuse monitoring and control in journalism. A major UK newswire, the Press Association, would like to know how much of the text it produces is reused by the British press. A simple classifier using word overlap is tested that identifies 75% wholly-derived, 53% partiallyderived and 71% non-derived newspaper articles correctly.
منابع مشابه
Postgraduate Transfer Report.PDF
This thesis builds upon our current understanding of text reuse by proposing a hypothetical framework of text reuse and applying this abstract definition to a specific domain, that of journalistic reuse. The framework aims to explore a suitable measure of reuse and determine suitable discriminators for document derivation. Although text can be reused verbatim (word-for-word), in most cases, tex...
متن کاملThe METER Corpus: A corpus for analysing journalistic text reuse
As a part of the METER (MEasuring TExt Reuse) project we have built a new type of comparable corpus consisting of annotated examples of related newspaper texts. Texts in the corpus were manually collected from two main sources: the British Press Association (PA) and nine British national newspapers that subscribe to the PA newswire service. In addition to being structured to support efficient s...
متن کاملBuilding and annotating a corpus for the study of journalistic text reuse
In this paper we present the METER Corpus, a novel resource for the study and analysis of journalistic text reuse. The corpus consists of a set of news stories written by the Press Association (PA), the major UK news agency, and a set of stories about the same news events, as published in various British newspapers. In some cases the newspaper stories are rewritten from the PA source; in other ...
متن کاملUsing the XARA XML-Aware Corpus Query Tool to Investigate the METER Corpus
The METER (MEasuring TExt Reuse) corpus is a corpus designed to support the study and analysis of journalistic text reuse. It consists of a set of news stories written by the Press Association (PA), the major UK news agency, and a set of stories about the same news events, as published in various British newspapers, some of which were derived from the PA version and some of which were written i...
متن کاملText Reuse Detection using a Composition of Text Similarity Measures
Detecting text reuse is a fundamental requirement for a variety of tasks and applications, ranging from journalistic text reuse to plagiarism detection. Text reuse is traditionally detected by computing similarity between a source text and a possibly reused text. However, existing text similarity measures exhibit a major limitation: They compute similarity only on features which can be derived ...
متن کامل